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LATENT DEMOGRAPHIC PROFILE ESTIMATION IN 
HARD-TO-REACH GROUPS 

By Tyler H. McCormick^ and Tian Zheng^ 

University of Washington and Columbia University 

The sampling frame in most social science surveys excludes members 
of certain groups, known as hard-to-reach groups. These groups, or sub- 
populations, may be difficult to access (the homeless, e.g.), camouflaged 
by stigma (individuals with HIV/AIDS), or both (commercial sex work- 
ers). Even basic demographic information about these groups is typically 
unknown, especially in many developing nations. We present statistical 
models which leverage social network structure to estimate demographic 
characteristics of these subpopulations using Aggregated relational data 
(ARD), or questions of the form "How many X's do you know?" Unlike 
other network-based techniques for reaching these groups, ARD require no 
special sampling strategy and are easily incorporated into standard surveys. 
ARD also do not require respondents to reveal their own group member- 
ship. We propose a Bayesian hierarchical model for estimating the demo- 
graphic characteristics of hard-to-reach groups, or latent demographic pro- 
files, using ARD. We propose two estimation techniques. First, we propose 
a Markov-chain Monte Carlo algorithm for existing data or cases where the 
full posterior distribution is of interest. For cases when new data can be 
collected, we propose guidelines and, based on these guidelines, propose a 
simple estimate motivated by a missing data approach. Using data from 
McCarty et al. [Human Organization 60 (2001) 28-39], we estimate the age 
and gender profiles of six hard-to-reach groups, such as individuals who 
have HIV, women who were raped, and homeless persons. We also evaluate 
our simple estimates using simulation studies. 

1. Introduction. Standard surveys often exclude members of the certain 
groups, know as hard-to-reach groups. One reason these individuals are ex- 
cluded is difficulty accessing group members. Persons who are homeless are 
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very unlikely to be reached by a survey which uses random-digit dialing, 
for example. Other individuals can be accessed using standard survey tech- 
niques, but are excluded because of issues in reporting. Members of these 
groups are often reluctant to self-identify because of social pressure or stigma 
[Shelley et al. (1995)]. Individuals who are homosexual, for example, may 
not be comfortable revealing their sexual preferences to an unfamiliar sur- 
vey enumerator. A third group of individuals is difficult to reach because of 
issues with both access and reporting (commercial sex workers, e.g.). 

Even basic demographic information about these groups is typically un- 
known, especially in developing nations. We propose a Bayesian hierarchi- 
cal model for estimating the demographic characteristics of hard-to-reach 
groups, or latent demographic profiles. Specifically, these profiles reveal fea- 
tures such as the number of males in a certain age range, say, 20-30 years 
old, who have HIV. Sociologically, this information yields insights into the 
characteristics of some of the most socially isolated members for the pop- 
ulation. Along with its contribution to our understanding of contemporary 
social institutions, estimating demographic profiles for these groups also has 
public health benefits. The distribution of infected individuals influences the 
size of the public health response. UNAIDS — the joint United Nations pro- 
gram on HIV/AIDS, for example, currently sponsors several projects using 
a variety of techniques to estimate the sizes of populations most at-risk for 
HIV/AIDS [UNAIDS (2003)]. The proposed method would, along with esti- 
mating the size of the population, provide latent demographic profiles. This 
information would not only help calibrate the scale of the response but also 
tailor programs to the specific needs of population members. 

One approach to estimating demographic information about hard-to-reach 
groups is to reach members of these groups through their social network. 
Some network-based approaches, such as Respondent-driven Sampling (RDS), 
recruit respondents directly from other respondents' networks [Heckathorn 
(1997, 2002)], making the sampling mechanism similar to a stochastic pro- 
cess on the social network [Goel and Salganik (2009)]. RDS affords re- 
searchers face-to- face contact with members of hard-to-reach groups, facili- 
tating exhaustive interviews and even genetic or medical testing. The price 
for an entree to these groups is high, however, as RDS uses a specially de- 
signed link-tracing framework for sampling. Estimates from RDS are also bi- 
ased because of the network structure captured during selection, with much 
statistical work surrounding RDS being intended to reweight observations 
from RDS to have properties resembling a simple random sample. 

Another approach is Aggregated relational data (ARD) or "How many 
X's do you know" questions [Killworth et al. (1998a)]. In these questions, 
"X" defines a population of interest (e.g.. How many people who are home- 
less do you know?). A specific definition of "know" defines the network the 
respondent references when answering the question. In contrast to RDS, 
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ARD do not require reaching members of the hard-to-reach groups directly. 
Instead, ARD access hard-to-reach groups indirectly through the social net- 
works of respondents on standard surveys. ARD never affords direct access 
to members of hard-to-reach populations, making the level of detail achiev- 
able though RDS impossible with ARD. Unlike RDS, however, ARD require 
no special sampling techniques and are easily incorporated into standard sur- 
veys. ARD are, therefore, feasible for a broader range of researchers across 
the social sciences, public health, and epidemiology to implement with sig- 
nificantly lower cost than RDS. 

In this paper, we propose a model for estimating latent demographic 
profiles using ARD. The ease of implementation of ARD means that the 
models proposed here will make the demographic characteristics of hard-to- 
reach groups available to the multitude of researchers collecting data using 
standard survey methodology. Specifically, we propose a Bayesian hierarchi- 
cal model for estimating the demographic characteristics of hard-to-reach 
groups using ARD. When the full posterior is of interest, we propose a 
Markov-chain Monte Carlo algorithm. 

Given the ease of collecting ARD, we speculate that many researchers 
may be interested in including ARD questions on future surveys. In this 
case, we show that estimates for some network features very close to those 
achieved using MCMC can be obtained using significantly simpler estima- 
tion techniques under certain survey design conditions. Along with giving 
survey guidelines, we propose a simpler estimation technique based on the 
EM algorithm and regression. Using data from McCarty et al. (2001), we 
estimate the age and gender profiles of six hard-to-reach groups, such as 
individuals who have HIV, women who were raped, and homeless persons. 

In Section 2 we contextualize our proposed method by reviewing previous 
statistical methods for estimating network features using ARD. Then, we 
describe a method for estimating demographic profiles from hard-to-reach 
populations. Section 4 illustrates our method using data from McCarty et al. 
(2001). After demonstrating the utility of our model. Section 5 describes 
how, under certain survey design conditions, we can obtain similar estimates 
without the computational sophistication required by MCMC. 

2. Previous research on ARD. ARD are commonly used to estimate the 
size of populations that are difficult to count directly. The scale-up method, 
an early method for ARD, uses ARD questions where the subpopulation 
size is known (people named Michael, e.g.) to estimate degree in a straight- 
forward manner. Suppose that you know two persons named Nicole, and 
that at the time of the survey, there were 358,000 Nicoles out of 280 mil- 
lion Americans. Thus, your two Nicoles represent a fraction (2/358,000) of 
all the Nicoles. Extrapolating to the entire country yields an estimate of 
(2/358,000) X (280 million) = 1560 people known by you. Then, the size of 
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unknown subpopulations is estimated by solving the given equation for the 
unknown subpopulation size with the estimated degree. Using this method, 
ARD has been used extensively to estimate the size of populations such as 
those with HIV/AIDS, injection drug users, or the homeless [e.g., Killworth 
et al. (1990, 1998b)]. 

The scale-up method is easy to implement but does not account for net- 
work structure. Consider, for example, asking a respondent how many people 
named "Rose" she/he knows. If knowing someone named Rose were entirely 
random, then each respondent would be equally likely to know each of the 
one-half million Rose's on the hypothetical list; that is, each respondent on 
each Rose is a Bernoulli trial with a fixed success probability. Network struc- 
ture makes these types of independence assumptions invalid. Since Rose is 
most common among older females and people are more likely to know in- 
dividuals of similar age and the same gender, older female respondents are 
more likely to know a given Rose than older male respondents. Assuming 
independent responses induces bias in the individuals' responses. Since es- 
timates of hard-to-count populations are then constructed using responses 
to ARD, the resulting estimates are also biased [Killworth et al. (1998a), 
Bernard et al. (1991)]. 

Zheng, Salganik and Gelman (2006) and McCormick, Salganik and Zheng 
(2010) propose hierarchical models for ARD which partially address the 
manifestations of network structure present in ARD. McCormick, Salganik 
and Zheng (2010) develop a model specifically for estimating respondents' 
degree (network size) and population degree distribution. Though this model 
accounts for the network structure described in the above example, Mc- 
Cormick, Salganik and Zheng (2010) do not address hard-to-reach groups. 
Zheng, Salganik and Gelman (2006) present a model which estimates the 
sizes of hard-to-reach groups [see Figure 5 in Zheng, Salganik and Gel- 
man (2006)] . This paper presents a model which provides richer information 
about hard-to-reach groups by estimating both subpopulation sizes and the 
demographic breakdown of individuals within these groups. 

3. Estimating latent profiles. In this section we describe a model for es- 
timating latent demographic profiles for hard-to-reach groups. This method 
will provide information about the demographic makeup of groups which are 
often difficult to access using standard surveys, such as the proportion of 
young males who are infected with HIV. The observations, yik, represent the 
number of individuals in subpopulation k known by respondent i. In ARD, 
respondents are conceptualized as egos, or senders of ties in the network. 
We divide the egos into groups based on their demographic characteristics 
(males 20-40 years old, e.g.). The individuals who comprise the counts for 
ARD are the alters, or recipients of links in the network. The alters are 
also divided into groups, though the groups need not be the same for both 
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the ego and the alter groups. Under this setup members of hard-to-reach 
groups are one type of alter. Thus, determining the alter groups determines 
the demographic characteristics of the hard-to-reach groups which can be 
estimated. We model the number of people that respondent i is connected 
to in subpopulation k as 

yik ~ Neg-Binom(/iifce,Wfc) 

(3.1) 

A 

where fj-ike = di m(e, a)h{a, k) 

a=l 

and ujk represents the variation in the relative propensity of respondents 
within an ego group to form ties with individuals in a particular subpopula- 
tion k. The degree of person i is di and e is the ego group that person i be- 
longs to. The h{a, k) term is the relative size of subpopulation group k within 
alter group a (e.g., 4% of males between ages 21 and 40 are named Michael). 
The mixing coefficient, m{e,a), for a respondent with degree di = "^^=1 
between ego-group e and alter-group a is 

'di, 



m{e, a) = E 



di 



I m ego group e 



where dia is the number of person i's acquaintances in alter group a. That is, 
m(e,a) represents the expected fraction of the ties of someone in ego-group 
e that go to people in alter-group a. For any group e, X]a=i m(e, a) = 1. 

The vector of mixing rates for an ego group, (m(e, 1), . . . , m(e, ■> enters 
the likelihood via an inner product with h{a,k); therefore, its components 
are only identifiable if the A by K matrix of h{a,k) terms, HaxK, has 
rank A. This condition requires K > A and that the columns of HaxK not 
be perfectly correlated. When all elements of HaxK are fixed, then (3.1) 
is the LRNM model from McCormick, Salganik and Zheng (2010). Specifi- 
cally, McCormick, Salganik and Zheng (2010) propose asking ARD questions 
about populations where the elements of h{a, k) are readily available, such 
as first names in the United States population. When /i(a, k) is known, it 
is simply N^k/Na or the number of individuals in alter group a who have 
characteristic k divided by the number of individuals in alter group a. 

In hard-to-reach groups, /i(a, k) is rarely known. In many cases, even the 
number of individuals in a hard-to-reach group, N^, is unknown. In the 
following section we propose a method for estimating /i(a, k) for hard-to- 
reach groups using information from groups when h{a, k) is available. This 
method provides information beyond the size of the subpopulation group, 
also estimating the number of individuals in each of the a alter groups, Nak- 

In summary, the number of people that person i knows in subpopulation k, 
given that person i is in ego-group e, is based on person i's degree (di), 
the proportion of people in alter-group a that belong to subpopulation k, 
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{h{a,k)), and the mixing rate between people in group e and people in 
group a, {m{e, a)). Additionally, if we observe random mixing, then m(e, a) = 

Na/N. 

Similar to Zheng, Salganik and Gelman (2006), a negative binomial model 
is assumed in (3.1) for each with an overdispersion, w^, parameter mea- 
suring the residual relative propensity of respondents to form ties with indi- 
viduals in group k, controlling for the variations that define the ego groups. 

3.1. Latent demographic profiles. We propose a two-stage estimation pro- 
cedure. We first use a multilevel model and Bayesian inference to esti- 
mate di, m{e,a), and using the latent nonrandom mixing model de- 
scribed in McCormick, Salganik and Zheng (2010) for the subpopulations 
where h{a,k) = N^k/^a is known. Second, conditional on this information, 
we estimate the latent profiles for the remaining subpopulations. 

For the estimation of the McCormick, Salganik and Zheng (2010) model 
components, we assume that \og{di) follows a normal distribution with mean 
lid and standard deviation ad- Zheng, Salganik and Gelman (2006) postu- 
late that this prior should be reasonable based on previous work, specif- 
ically McCarty et al. (2001), and found that the sampler described using 
this prior mixed well and satisfied posterior predictive checks. McCormick, 
Salganik and Zheng (2010) also conducted simulation experiments which 
demonstrated that the shape of the posterior was not an artifact of this 
prior assumption. We estimate a value of m(e,a) for all E ego groups and 
all A alter groups. For ego group, e, and alter group, a, we assume that 
m{e,a) has a normal prior distribution with mean Hm{e,a) ^^^"^ standard 
deviation crjn{e,a)- For ^'k^ use independent uniform(0, 1) priors on the 
inverse scale, p{l/uj'f^) oc 1. Since is constrained to (1, oo), the inverse falls 
on (0,1). The Jacobian for the transformation is For the latent pro- 

files, define th{a,k) ^ the indicator of the latent profiles. The matrix h{a, k) 
is defined as Nak/^a when population information is available (lh(a,fc) = 0) 
and entries to be estimated {th{a,k) = 1) ^ire given normal priors on the 
log scale with mean fi^ and standard deviation a^- That is, we model each 
log{h{a,k)) ~N(^/i,(T^) with a common mean and variance for all entries 
in the latent profile matrix. Since many of the profiles are close to zero, 
we found that the additional structure from a common prior across all en- 
tries improved convergence without being too rigid to capture fiuctuations 
in latent intensity. Finally, we give noninformative uniform priors to the hy- 
perparameters fid, f^m{e,a), fJ-h, (^d and (T„,(e,a), CFh- The joint posterior density 
can then be expressed as 

p{d, m(e, a) , w', fid, lJ'm{e,a) , CTd, Crm{e,a) Iv) 
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X i^—j llNilog{di)\fid,CTd) 

E 

-x'[lN{m{e,a)\Hm{e,a),(^mie,a)) 
e=l 

K A 

X H{a,k) Y{Y{N{h{a,k)\fi^a,k),crh{a,k)), 

k=la=l 

where ^ik = dif{Y,^^im{e,a)h{a,k))/{uj',^ - 1). 

Adapting Zheng, Salganik and Gelman (2006) and McCormick, Salganik 
and Zheng (2010), we use a Gibbs-Metropohs algorithm in each iteration v: 

(1) For each i, update di using a Metropohs step with jumping distri- 
bution log(d*) ^ N{d^^ ^\ (jumping scale of dif'). 

(2) For each e, update the vector m(e, •) using a Metropolis step. Define 
the proposed value using a random direction and jumping rate. Each of 
the A elements of m(e, •) has a marginal jumping distribution m{e,a)* ~ 
N{m{e,a)'-'"~^\ (jumping scale of m(e, •))^). Then, rescale so that the row 
sum is one. 

(3) Update ~ N{jld,aj/n), where fid = ^Sf^^di. 

(4) Update aj ~ Inv-x^(n - where d-j = ^x S^=i(di - fid)^. 

(5) Update fim{e,a) ~ ^(Am(e,a),0-m(e,a)M)' ^^^^ ^ wheve fim{e,a) = 

(6) Update a^^^^^^^liw-xHA - 1,-t^(,^^)), for each e where a^^^^^^ = 

i X J:^^^{m{e,a) - flrn{e,a) f- 

(7) For each k with a known profile, update co'f, using a Metropolis step 

with jumping distribution u'j* ~ ^i^k^ ^\ (jumping scale of w^)^)- 
We now proceed to estimate the H latent profiles: 

(8) For each element of /i(a, k) where th{a,k) = 1) update h{a, k) using a 
Metropolis step with jumping distribution /i(a, k)* ~ N{h{a, k)^'"~^'' , (jump- 
ing scale of h{a,k)Y). 

(9) Update fih ~ ^(^/^'^^/(^ x i/)) for each /c where 

fc=la=l 

(10) Update cj^ ~ Inv-x^((A x iJ) - 1, a^) where 

^ ft- A 

fc=l a=l 
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(11) For each k where h{a,k) is estimated, update cj^ using a Metropohs 
step with jumping distribution o;^* ~ N{uj'^'" ^\ (jumping scale of tj^,)^). 

Having h{a, k) for some subpopulations is critical to estimating latent struc- 
ture through latent profiles. Often, h{a, k) can be obtained from publicly 
available sources (Census Bureau, Social Security Administration, etc.) for 
subpopulations such as first names. The number of populations with known 
h{a, k) impacts the precision of the estimates for subpopulations with un- 
known profiles. Adding another known subpopulation increases the hypo- 
thetical sample size of each question, in essence asking each respondent if 
they know more alters. McCormick, Salganik and Zheng (2010) show that 
the total size of the subpopulations asked is related to the variance of es- 
timated degree. Since known subpopulations are used to estimate degree, 
adding another subpopulation impacts variability in degree estimation in 
the first stage of our procedure, which propagates to estimates of h{a,k). 
The alter groups where information is available for known h{a, k) also limit 
the type of latent structure that can be estimated. McCormick, Salganik 
and Zheng (2010) create alter groups based on age and gender but note that 
separating alters based on other factors (such as race) would provide valu- 
able information. The Census Bureau collects the information required to 
conduct such an analysis; however, McCormick, Salganik and Zheng (2010) 
report that their efforts to obtain the data were ultimately unsuccessful. 

The choice of populations with known h{a, k) is also important in ensuring 
that the mixing matrix is estimated appropriately. First, the subpopulations 
with known sizes need to be sufficiently heterogeneous with respect to their 
interactions with the ego groups to adequately estimate the mixing matrix. 
If, for example, our mixing matrix consists of only gender and we chose 
to use first names for subpopulations with known h{a,k), then we should 
use a set of both male and female names. If we only asked male names, 
then we could estimate the propensity for males/females to interact with 
males but could not estimate the propensity of either gender to interact 
with females. Second, we make an assumption about the representativeness 
of respondents' networks rather than of the respondents themselves. For our 
method it would not be an issue, for example, if we recruit a smaller fraction 
of men into the survey than the proportion of men in the population. Instead, 
we would encounter bias if the networks of the men we selected were not 
representative of male networks in the population. This could happen, for 
example, if we recruit only men who know a disproportionately large number 
of women. This issue could also be exacerbated by differential nonresponse. 
Consider, for example, the case where individuals who know members of 
the hard-to-reach groups are less likely to answer questions than the general 
population. We continue this discussion in Section 4 where we postulate 
that errors in the estimates obtained in our data could be from bias in the 
estimates of the mixing matrix. 
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Finally, certain types of bias which are consistently associated with ARD 
should also be considered when selecting the subpopulations with known 
h{a,k). We assume, for example, that the responses are free from transmis- 
sion error, when a respondent knows a member of a subpopulation but is 
unaware of the alter's membership. McCormick, Salganik and Zheng (2010) 
suggest using first names since they represent the minimum conceivable pos- 
sibility of transmission error. We also assume that respondents accurately 
recall the number of individuals they know in a given subpopulation. In re- 
ality, underestimation is common in large groups [see McCormick and Zheng 
(2007) for a detailed discussion]. 

4. Results for hard-to-count populations. We use data from a telephone 
survey by McCarty et al. (2001) with 1375 respondents and twelve names 
with known demographic profiles. These data have been analyzed in sev- 
eral previous studies and are typical ARD which are becoming increasingly 
common. The age and gender profiles of the names are available from the 
Social Security Administration. On this survey, "know" is defined "that you 
know them and they know you by sight or by name, that you could con- 
tact them, that they live within the United States, and that there has been 
some contact (either in person, by telephone, or mail) in the past 2 years." 
We then estimate latent profiles for seven subpopulations. Six are groups 
often considered hard-to-count while the seventh uses ARD to learn about 
population social structure. Figure 1 displays the latent profiles for six pop- 
ulations which are often described as hard-to-count. For both individuals 
with HIV and those with AIDS we estimate the highest concentration to be 
among youth and young adult respondents. We estimate a higher concen- 
tration of young adult males than females for both HIV and AIDS with the 
concentration decreasing with age. 

Subpopulations such as victims of homicide or persons who have com- 
mitted suicide portray a key advantage of using ARD for measuring these 
populations. Our model estimates characteristics of these populations with- 
out requiring members of these populations to be reached directly through 
our survey. We compared our estimates of the number of individuals mur- 
dered in the past year with the 1999 Uniform Crime Reports ( UCR) [Fed- 
eral Bureau of Investigation (1999)] and figures from the Centers for Disease 
Control National Center for Injury Prevention and Control (CDC) [Centers 
for Disease Control (2011)]. A technical distinction between the two sources 
for external validation is that the CDC figures measure homicides (killing of 
another person) while UCR tally murders (unlawful killing of another per- 
son). So-called justifiable homicides (police officers using deadly force, e.g.) 
are therefore not counted in the UCR figures. This distinction accounts for 
part of the discrepancy between the two data sources (the FBI only keeps 
records on firearms-related justifiable homicides), though the exact amount 
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Fig. 1. Estimates of latent profiles for SIX hard-to-reach populations. The lighter text rep- 
resents males and the darker text females. Letters correspond to posterior medians, while 
lines represent the width of the middle half of the posterior distribution. The estimated 
profiles are consistent with contemporary understanding of the profiles of these groups. 



could not be determined from available data. Also, the McCarty et al. (2001) 
survey took place partially in January of 1999 and partially in June, mean- 
ing that this report does not capture precisely the period respondents were 
asked to recall. Since homicide statistics do not typically change drastically 
on a national scale over the course of a year, we expect, nonetheless, that 
these figures are reasonable for comparison. In all six age-gender categories, 
the UCR and CDC estimates are within the middle 50% of the posterior dis- 
tribution of our estimates [computed by multiplying h{a, k) by the number 
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of individuals in the given age-gender group]. For males 20-40, for example, 
the UCR counts approximately 5300 murders while that CDC counts just 
under 7300 homicides. Our method estimates the first quartile of the pos- 
terior distribution as roughly 300 murders and the third quartile as around 
7300. Similarly, for females between 40 and 60 the middle half of our pos- 
terior lies between around 100 and 2300 while the UCR records around 700 
and the CDC counts about 1900. Overall our estimates underrepresent the 
disparity in the proportion of male and female homicide victims, which we 
believe is due to the individuals who are most likely associated with mur- 
dered individuals being underrepresented in the survey frame. McCormick 
et al. (2009) found a similar issue in an internet survey. 

Our estimates for women who were raped in the past year reveal a common 
issue with ARD questions. Though the questions asks respondents to recall 
only women who were raped, we hypothesize respondents will include men 
who are connected to a woman who was raped, even if the woman does 
not meet the definition of a tie. Respondents may also be likely to over- 
recall such traumatic events. Similarly, our estimates for female suicides are 
consistently higher than for males (though the difference is well within the 
uncertainty of measurements). Males are actually nearly four times as likely 
to commit suicide as females [Centers for Disease Control and Prevention 
(2011)]. This discrepancy might be because of the isolation of many suicide 
victims before their deaths, making them difficult to reach with ARD. Our 
estimates are especially consistent with the case of males being more isolated 
than females before committing suicide. 

Recent work has used ARD for estimating population-level social phe- 
nomenon outside the context of hard-to-reach groups [DiPrete et al. (2011)]. 
To demonstrate the applicability of latent profile estimation in this context, 
Figure 2 shows the latent profile of individuals who opened a small busi- 
ness in the past year. The trend across ages in the profiles for males and 
females is similar, with most new business openers being younger adults [Of- 
fice of Advocacy, U.S. Small Business Administration (1997)]. The fraction 
of males opening a business is consistently higher, however. This discrep- 
ancy is especially pronounced among young adults, the group with highest 
overall propensity. 

Overall, our estimates of latent profiles are similar to estimates from other 
sources for the U.S. population. The similarity between previous knowledge 
about the profiles of these populations and our estimates indicates that 
ARD contain a significant amount of information about the latent structure 
of these subpopulations. The estimates presented in this section were ob- 
tained using the MCMC algorithm described in Section 3. In the following 
section we present an alternative regression-based estimation strategy which 
is significantly less time-consuming to implement and provides comparable 
performance when certain conditions are satisfied. 
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Fig. 2. Estimates of latent profiles for individuals starting their own business. Letters 
correspond to posterior medians, while lines represent the width of the middle half of 
the posterior distribution. The lighter text represents males and the darker text females. 
Overall estimates are higher for males than for females, with the largest discrepancy for 
young adults. 



5. Simple calculations and design recommendations. Given data from 
an existing survey, we have shown that our method will recover features 
of unobserved subpopulation profiles. We propose an alternative strategy 
to recover this information under certain conditions without using MCMC. 
Our simple method combines estimation and survey-design strategy, mak- 
ing it well-suited for researchers who intend to collect ARD. McCormick, 
Salganik and Zheng (2010) proposed the scaled-down condition for selecting 
subpopulations to reduce bias in simple estimates of respondent degree. To 
estimate latent profiles, we need accurate degree and mixing matrix esti- 
mates. To accurately estimate the mixing matrix, we introduce a missing 
data perspective for ARD and propose an estimator based on the EM algo- 
rithm. 

In Section 5.1 we review degree estimation and the scaled-down condition. 
Next, Section 5.2 describes a simple ratio estimator for the mixing matrix 
motivated by the EM algorithm. We then describe a regression based es- 
timator for latent profiles in Section 5.3 and demonstrate its effectiveness 
through simulation studies in Section 5.4. 

5.1. Estimating degree. In this section we review work on estimating 
respondent degree by McCormick, Salganik and Zheng (2010). We use these 
estimates in subsequent sections to estimate mixing rates and latent profiles. 

McCormick, Salganik and Zheng (2010) develop a degree estimator based 
on the scale-up method of Killworth et al. (1998b). This approach uses re- 
spondents' answers to ARD questions and recalibrates based on the propor- 
tion of the total population comprised of the populations used on the survey. 
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For example, if a respondent reports knowing 3 women who gave birth, this 
represents about l-milhonth of ah women who gave birth within the last 
year. This information then could be used to estimate that the respondent 
knows about 1-millionth of all Americans, (3/3.6 million) • (300 million) w 
250 people. 

The precision of this estimate can be increased by averaging responses of 
many groups, yielding the scale-up estimator [Killworth et al. (1998b)], 

where yik is the number of people that person i knows in subpopulation k, 
Nk is the size of subpopulation k, and N is the size of the population. 

The scale-up estimator is easy to compute, yet can induce substantial bias 
if subpopulations aren't selected correctly. The scale-up estimator assumes 
random mixing across the K populations. That is, that the propensity for 
an individual to know members of a subpopulation depends only on the size 
of the subpopulation. In practice, this is rarely the individuals tend 

to know more alters who are demographically similar to themselves. 

McCormick, Salganik and Zheng (2010) derived a scaled-down condition 
for selecting names so that the collection of individuals with first names that 
are used to collect ARD constitute a balanced and representative sample of 
the population. In other words, the combined demographic profiles of the 
used first names match those of the general population. Specifically, 

Na N 

Using the scaled-down condition, McCormick, Salganik and Zheng (2010) 
demonstrate that the scale-up estimator produces reduced-bias estimates of 
degree. In deriving the subsequent latent profile estimates, we assume we 
have selected subpopulations which satisfy the scaled-down condition. 

5.2. A simple ratio estimator of individual mixing rates. If for a given 
respondent, i, we could take all the members of the social network with 
which i has a link and place them in a room, we would compute the mixing 
rate between the ego and a given alter group, a = {1, . . . ,A), by dividing the 
room in A mutually exclusive sections and asking alters to stand in their 
respective group. The estimated mixing rate would then be the number of 
people standing in a given group divided by the number of people in the 
room. 

We could also perform a similar calculation by placing a simple random 
sample of size n from a population of size in a room. Then, after dividing 
the alters into mutually exclusive groups, we could count yia or the number 
of alters respondent i knows in the sample who are in each of the a alter 
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groups. Since we have a simple random sample, we can extrapolate back to 
the population and estimate the degree of the respondent, di, and within 
alter group degree, dia, as 



A 

di = '^ Vial {n/N) and dia = Via/ {Ua/Na). 
a=l 

Given these two quantities, we can estimate the mixing rate between the 
respondent and an alter group by taking the ratio of alters known in the 
sample who are in alter group a over the total number known in the sample. 
This computation is valid because we assumed a simple random sample, thus 
that (in expectation) the demographic distribution of alters in our sample 
matches that of the population. 

In ARD, the distribution of the hypothetical alters we sample depends 
on the subpopulations we select. If we only ask respondents subpopulations 
which consist of young males, for example, then our hypothetical room from 
the previous example would contain only the respondent's young, male al- 
ters. Estimating the rate of mixing between the respondent and older females 
would not be possible in this situation. Viewed in this light, ARD is a form 
of cluster sampling where the subpopulations are the clusters and respon- 
dents report the presence/absence of a tie between all alters in the cluster. 
Since the clusters are no longer representative of the population, our esti- 
mates need to be adjusted for the demographic profiles of the clusters [Lohr 
(1999)]. Specifically, if we observe Uika for subpopulations A: = (1, . . . , K) and 
alter groups a = {1, . . . , A), then our estimates of di and dia become 

K I / K \ 



f^i = ^ Vik / ^ A^fc/iV and dia 




k=i ' \k=i / 

where Nk is the size of subpopulation k and Nak is the number of members of 
subpopulation k in alter group a. To estimate the mixing rate, we could again 
divide the estimated number known in alter group a by the total estimated 
number known. Under the scaled-down condition the denominators in the 
above expressions cancel and the mixing estimate is the number known in 
the subpopulations that are in alter group a over the total number known 
in all K subpopulations. 

In the examples above, we have assumed the alters are observed so that 
■jjika can be computed easily. This is not the case in ARD, however, since we 
observe only the aggregate number of ties and not the specific demographic 
makeup of the recipients. Thus, ARD are a cluster sample where the specific 
ties between the respondent and members of the alter group are missing. 

If we ignore the residual variation in propensity to form ties with group k 
individuals due to noise [see (3.1) in Section 3], we may assume that the num- 
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ber of members of subpopulation k in alter group a the respondent knows, 
Uika, follows a Poisson distribution. Under this assumption, we can estimate 
n^'ia by imputing yn^-a as part of an EM algorithm [Dempster, Laird and Ru- 
bin (1977)]. Specifically, for each individual define y,-™™'' = {Vika^ ■ ■ ■ ^Vha)^ 
as the complete data vector for each alter group. The complete data log- 
likelihood for individual I's vector of mixing rates, mi = {mn, . . . ,171.1^)'^, is 

£{mi ; y^-™™^ , . . . , y^-^™^ ) ) which has the form 



(5.1) 



^(m,;y(r\...,yJr^: 
K A 



XI XI ^Poisson (^Vika, Kka = diniia^^^ ^ . 



k=l a=l 

Using (5.1), we derive the following two updating steps for the EM: 



S:^K (i-l) 
^{t) _ Z^fc=l Vika 



If one sets mf^ = Na/N, which corresponds to random mixing in the pop- 
ulation, and runs one EM update, this would result in the following simple 
ratio estimator of the mixing rate for individual i: 

[O.Z) ruia— —j^ . 

2_jk-l Vik 

In our simulation studies, this simple estimator produces estimates very 
close to the converged EM estimates. Additionally, it is easy to show that the 
simple ratio estimate, lUia, is unbiased if Nak/^a 7^ for only one alter group 
a and that for any a there exists a subpopulation, k, such that Nak = ^a- We 
refer to this condition as complete separability. Therefore, (5.2) constitutes 
a simple estimate for individual mixing rate and can be used to estimate 
average mixing behaviors of any ego group. 

5.3. Regression-based estimates for latent profiles. The estimates for re- 
spondent degree and mixing estimates rely on latent profile information from 
some populations. Using these estimates, we now develop a regression-based 
estimator for unobserved latent profiles. For each respondent and each un- 
known subpopulation we now have 

A 

(5.3) yik = y^^diihiah{a,k). 

a=l 
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If we denote as the n x A matrix with elements diifiia and the vec- 
tor h{-,k) = /3fc, then (5.3) can be regarded as a hnear regression equation, 
ifj^ = Xfc/3fc, with the constraint that coefficients, f3k, are restricted to be 
nonnegative. Lawson and Hanson (1974) propose an algorithm for comput- 
ing these coefficients. Since the m.^ sum to one across alter groups, the 
columns of are collinear. This could produce instability in solving the 
quadratic programming problem associated with finding our estimated la- 
tent proffies. In practice, we have found our estimates perform well despite 
this feature. 

5.4. Simulation experiments. We present simulation experiments to eval- 
uate our regression-based estimates under four strategies for selecting ob- 
served profiles. First, we created profiles which are completely separable. 
Second, we constructed profiles for the names satisfying the scaled-down 
condition presented in McCormick, Salganik and Zheng (2010) using data 
from the Social Security Administration. These names provide insights into 
the potential accuracy of our method using actual proffies. As a third case, 
we include the names from McCormick, Salganik and Zheng (2010) which vi- 
olate the scaled-downed condition and are almost exclusively popular among 
older respondents. For the fourth set of names, recall from Section 3 that 
the mixing matrix estimates are identifiable only if the matrix of known 
profiles, HaxK; has rank A. To demonstrate a violation of this condition, 
we selected a set of names with uniform popularity across the demographic 
groups, or nearly perfect collinearity. There is some correlation in the scaled- 
down names since several names have similar proffies. The degree of corre- 
lation is substantially less than in the flat proffies, however. 

In each simulation, we generated 500 respondents using the Latent Non- 
random Mixing Model in (3.1) [see also McCormick, Salganik and Zheng 
(2010)] with each of the four profile strategies. Mixing matrix estimates 
were calculated using the simple estimate derived from the first step of the 
EM algorithm in Section 5.2. We compare our mixing matrix estimates to 
the estimated mixing matrix from McCormick, Salganik and Zheng (2010), 
which we use to generate the simulated data. We evaluate the latent proffies 
using six names with proffies known from the Social Security Administra- 
tion. We repeated the entire process 1000 times. Figure 3 presents boxplots 
of the squared error in mixing matrix and latent proffie estimates. In both 
cases, the ideal, completely separable, profiles have the lowest error. The 
scaled-down names also perform well, indicating that reasonable estimates 
are possible even when complete separability is not. The flat profiles perform 
only slightly worse than the scaled-down names for estimating mixing but 
significantly worse when estimating latent profiles. The names which violate 
the scaled-down condition produce poor estimates of both quantities. 
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Fig. 3. Total mean squared error across all elements of the mixing matrix and latent 
profile matrix. The vertical axis is the sum of the errors across all eight alter groups. We 
generated 500 respondents using the four profile structures, then evaluated our ability to 
recover the mixing matrix estimated in McCormick, Salganik and Zheng (2010) and the 
known profiles of six additional names. We repeated the simulation 1000 times. In both 
cases the ideal profile has the lowest error, followed by the scaled-down names suggested 
by McCormick, Salganik and Zheng (2010). 

6. Conclusion. We present a method for estimating latent profiles in 
hard-to-reach groups using standard surveys. Our method has two stages. 
First, we use known profiles for some populations to estimate respondent 
degree and the rate of mixing between survey respondents and groups in the 
population. Next, conditional on these estimates, we infer latent structure 
in populations where profiles are unknown. For existing data, we present a 
Bayesian hierarchical model and MCMC algorithm. We also propose viewing 
ARD in the context of missing data and provide a simple ratio estimate of 
mixing rates based on the EM algorithm. We then describe a regression- 
based estimate for latent profiles. 

Despite its utility, there are several known issues with ARD. Using ARD 
in hard-to-reach populations presents special challenges which intersect with 
these known issues. Many events in this context are especially traumatic, 
leaving a more persistent signal in the respondent's memory than a typi- 
cal tie. This phenomenon causes respondents to over-count their ties with 
a specific subpopulation. In Section 4 we contend that our overestimation 
of the proportion of men who are women who were raped in the past year 
is due to respondents overestimating by counting males who are associated 
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with females who have been raped, for example. This issue is in some sense 
the opposite of that faced by early ARD surveys for degree estimation when 
the concern was respondents under-recalling acquaintances from large pop- 
ulations [Killworth et al. (2003)]. Hard-to-reach groups are also often more 
open to interpretation than standard subpopulations. McCarty et al. (2001) 
give the example of people opening their own business and the homeless, for 
example. While there is some ambiguity in whether or not an individual has 
opened a new business, there is likely much greater variability between re- 
spondents in their classification of an individual as homeless. Hard-to-reach 
groups are also often associated with social stigma. This stigma increases 
the likelihood that a respondent will know a member of a subpopulation but 
not be aware that the alter belongs to the subpopulation, known as trans- 
mission errors. Recent work by Salganik et al. (2011) offers new insights into 
the magnitude of transmission errors in the context of HIV/ AIDS, though 
the nature of the error likely depends heavily on the specific group of in- 
terests (respondents' decisions to reveal HIV status are likely quite different 
than their decision to discuss diabetes, e.g.). 

This method also makes an assumption that the networks of the respon- 
dents are representative of networks of similar individuals in the population. 
In Section 4, in our discussion of the ratio of males to females who commit 
suicide, another possible explanation is that our survey does not include 
enough individuals who are likely to know people who commit suicide. This 
bias could be present in the networks of respondents even if the sample is, 
within respondents, observably representative. This point demonstrates the 
potential for future work in modeling bias that comes not from the respon- 
dents selected, but from the features of the networks of these respondents. 
This type of sampling bias is related to previous work by Lavallee (2007) 
and could prove a promising area for future work. 

Our method demonstrates that ARD capture aspects of latent social 
structure through indirect observations of the social network. To do this, 
however, we require known profiles for some subpopulations. This require- 
ment limits the estimable latent profiles to features which are known for 
some subpopulation. In our examples we use first names and estimate age 
and gender profiles. We may, for example, be interested in the race/ethnic 
profiles of the hard-to-reach populations. We are unable to estimate this 
from our current data because of the issues with obtaining demographic 
profiles for first names mentioned in Section 4. An alternative approach, 
and direction for potential future work, would be estimating a geometric, 
multidimensional latent social space based on features of the actors and the 
social network [Hoff, Raftery and Handcock (2002), Hoff (2005)]. Such a 
technique would provide a sense of the broad topography of the network 
(similar to Bayesian multi-dimensional scaling) and elucidate similarities 
between network structure in hard-to-reach groups. 
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An additional direction for future work involves combining information 
from ARD with other forms of data collection to better understand hard- 
to-reach groups. As mentioned in Section 1, RDS provides detailed infor- 
mation about a biased sample of members of the hard-to-reach group. This 
detailed information is in contrast to the indirect, general information ob- 
tained through ARD. The missing-data framework presented in Section 5 
provides a first-step toward a general framework for combining information 
across various network-based data collection strategies. 
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